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To assess the codon evolution in virus—host systems, Avian coronavirus and its natural host Gallus gallus 
were used as a model. Codon usage (CU) was measured for the viral spike (S), nucleocapsid (N), non- 
structural protein 2 (NSP2) and papain-like protease (PL’’°) genes from a diverse set of A. coronavirus 
lineages and for G. gallus genes (lung surfactant protein A, intestinal cholecystokinin, oviduct ovomucin 
alpha subunit, kidney vitamin D receptor and the ubiquitary beta-actin) for different A. coronavirus repli- 
cating sites. Relative synonymous codon usage (RSCU) trees accommodating all virus and host genes ina 
single topology showed a higher proximity of A. coronavirus CU to the respiratory tract for all genes. The 
codon adaptation index (CAI) showed a lower adaptation of S to G. gallus compared to NSP2, PL?’ and 
N. The effective number of codons (Nc) and GC3z revealed that natural selection and genetic drift are the 
evolutionary forces driving the codon usage evolution of both A. coronavirus and G. gallus regardless of 
the gene being considered. The spike gene showed only one 100% conserved amino acid position coded 
by anA. coronavirus preferred codon, a significantly low number when compared to the three other genes 
(p< 0.0001). Virus CU evolves independently for each gene in a manner predicted by the protein function, 
with a balance between natural selection and mutation pressure, giving further molecular basis for the 
viruses’ ability to exploit the host’s cellular environment in a concerted virus—host molecular evolution. 
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1. Introduction 


Codon usage (CU) refers to the frequency of the occurrence of 
each codon for at least two-fold degenerate codons (Hershberg and 
Petrov, 2008), i.e., it is an indication of the ‘preference’ of a genome 
for one or more codons if more than one codon is possible for the 
same amino acid. 

Natural selection for efficient protein synthesis speed and fold- 
ing and genetic drift based on mutation pressure that leads to 
a homogeneous genome and the 3rd codon’s GC%s are the most 
evident forces under codon usage evolution that could lead to 
detectable codon usage bias (CUB) (Yang and Nielsen, 2008), which 
has been increasingly used in studies on virus and host molecular 
evolution. 

Avian coronavirus (Nidovirales: Coronaviridae: Coronavirinae: 
Gammacoronavirus), which originated approximately 4800 years 
ago (Woo et al., 2012) and has a large number of serotypes and 
genotypes, primarily infects the respiratory tract of laying hens, 
broilers and breeders but can also infect the kidneys, intestines 
and reproductive tracts of both females and males (Cook et al., 
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2012), depending on the pathotype. Though affinity to different 
classes of cell membrane glycans could be one of the explanations 
for the existence of the different viral pathotypes (Wickramasinghe 
et al., 2011), the exact mechanism for this level of diversity is still 
unknown. 

The 27.6 kb single-stranded positive sense RNA of A. coronavirus 
encodes 23 proteins, and the first two-thirds of the genome con- 
tains ORF 1, which encodes 15 non-structural proteins involved 
in RNA transcription and replication (Masters, 2006; Ziebuhr and 
Snijder, 2007). Among these, the papain-like protease (PLP"°) is the 
proteolytic processor of the N-proximal domain of polyproteins 
ppla and pplab (Ziebuhr et al., 2000). Non-structural protein 2 
(NSP2), the first in ORF 1 because the A. coronavirus lacks NSP1, has 
a still undefined role, though a role on global RNA synthesis has 
been suggested (Graham et al., 2005). 

Of the structural proteins, the spike glycoprotein (S) has a strong 
interaction with the host immune system and is so highly poly- 
morphic that mutations in only 10 amino acids on the amino 
terminal ectodomain (S1) could result in the loss of cross-reactivity 
(Cavanagh, 2007). While S1 allows the virus to attach to a2,3Sia, 
which is widespread in chicken cells (Winter et al., 2008), the 
carboxy terminal S2 has the capacity to fuse virus-to-cell and cell- 
to-cell membranes (Masters, 2006). 

The nucleocapsid (N) protein binds to the genomic RNA due to its 
positively charged amino acid domains, and though under a more 
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strict mutation constraint than S, positive selection plays a role in 
N evolution (Kuo et al., 2013; Masters, 2006). 

The codon usage of A. coronavirus has been reported to be highly 
to moderately biased but closer to that found in the respiratory 
tract of Gallus gallus when compared to other tissues (Brandao, 
2012). However, that report was limited because codon usage was 
measured based on only the spike gene. 

The aim of this study was to assess the evolution of codon usage 
in viral structural and non-structural genes and their molecular 
relationship with host codon usage using A. coronavirus and its 
natural host G. gallus as a model. 


2. Materials and methods 
2.1. Sequences 


2.1.1, A. coronavirus 

For A. coronavirus, sequences were chosen to promote diver- 
sity of geographic origin and serotype/genotypes, including the 
archetypical strains, with an effort to keep the same datasets if pos- 
sible. Because the number of complete genomes and genes for A. 
coronavirus available in GenBank did not allow for the representa- 
tion of such diversity, only partial genes were used in this study 
instead of complete ones to have the most diverse dataset possi- 
ble. As the accuracy of codon usage measurements is lower for short 
sequences, sequences <100 codons in length (Roth et al., 2012) were 
not included. Sequence redundancy was avoided by keeping only 
one sequence if any 100% nucleotide identity was found. 

Following these criteria, this study included 64 S protein 
sequences, codons 1-169 (14.6% of the 1162 S codons); 25 N protein 
sequences, codons 301-409 (26.7% of the 409 N codons); 18 NSP2 
sequences, codons 1-245 (36.4% of the 673 NSP2 codons); and 15 
papain-like protease sequences, codons 3-437 (99.5% of the 437 
PLP’ codons). The accession numbers are shown in Fig. 1. All indi- 
cated positions are relative to the complete genome of the Avian 
infectious bronchitis virus strain M41 (DQ834384.1). 


2.1.2. G. gallus 

Aiming to assess the codon usage of the different tissues 
in which A. coronavirus replicates in chicken, non-redundant 
complete codon sequences were retrieved from the GenBank 
database and from the G. gallus genome project for chole- 
cystokinin, expressed in the duodenum (NM_001001741.1 and 
GFC_000002315.3); lung surfactant pulmonary-associated pro- 
tein Al (SFTPA1), expressed in the lungs (NM_204606.1 and 
GFC_000002315.3); vitamin D receptor, expressed in the kidneys 
(NM_205098.1 and GFC_000002315.3); and ovomucin alpha sub- 
unit, expressed in the oviduct (AB046524.1 and GFC_000002315.3). 
As a reference, the complete G. gallus beta-actin gene (LO8165 and 
GFC_000002315.3) was included in the analyses as a ubiquitously 
expressed gene. 

All sequences used in this study can be found in Supplementary 
material 1. 


2.2. Relative synonymous codon usage (RSCU) 


RSCU, the relationship between the observed and the expected 
frequency of a codon if the synonymous codon usage is random 
(Roth et al., 2012), was calculated for 59 codons, excluding the 
single codons of methionine and tryptophan and the three stop 
codons, using the equation RSCU; =X;/(;X;/m) (Nei and Kumar, 
2000), where X; is the total count for a given codon, %;X; is the 
sum of the count for all synonymous codons regarding the amino 
acid under consideration and m is the number of possible isoaccep- 
tors for that amino acid, implemented in MEGA 5.0 (Tamura et al., 
2011). 


The continuous RSCU values from A. coronavirus and G. gallus 
genes were converted to binary data using the value 7 for RSCUs 
>1, when a given codon was preferred for a specific amino acid, or 0 
for RSCUs < 1, when the codon was not preferred (RSCU <1) or was 
neutral (RSCU =1). Finally, the combined dataset of the four viral 
and five host genes was used to build a binary 59 characters x 132 
sequences matrix (Supplementary material 2) for the presence or 
absence of a preferred codon, which was used to build a neighbor- 
joining tree (1000 bootstrap replicates) using PAUP, version 4.1b 
(Swofford, 2000). 


2.3. Codon adaptation index (CAI) 


The CAI is a measure of codon usage derived from the geometric 
mean of the relative codon adaptiveness for each codon based on 
a set of translationally optimal codons used as a reference (Roth 
et al., 2012) and can be calculated according to the equation 


61 


CAlg = | [Wie 
k=1 


Here, w, is the relative adaptiveness of the kth codon (61 codons; 
the three stop codons were excluded), and X;. is the fraction of the 
codon k relative to the total number of codons in the gene. 

Values closer to 1 indicate a high fitness in terms of codon usage 
for a given codon sequence in relation to the reference system 
(Sharp and Li, 1987), i.e., a high adaptation of viral genes to the 
host. 

The CAI was calculated for sequences from both A. coronavirus 
and G. gallus using a reference set of highly expressed G. gallus genes 
available in the ACUA 1.0 software (Vetrivel et al., 2007). 


2.4. Effective number of codons (Nc) 


Nc is a measure of the total number of different codons present 
in a sequence and shows the bias from equal use of all synony- 
mous codons for a given amino acid, with each synonymous codon 
treated as an allele as in the calculation of the effective number of 
alleles in population genetics (Roth et al., 2012). Nc values range 
from 20 to 61, with values closer to 61 indicating a lower bias 
(Wright, 1990). 

Nc  was- calculated according to the equation 
Nc=2+(9/F2)+(1/F3)+(5/F4)+(3/F6), where F is the average 
homozygosity for equal use of each synonymous codon for each 
class of degeneracy ranging from 2 to 6, using ACUA 1.0 (Vetrivel 
et al., 2007). 


2.5. Codon selection test 


The expected effective number of codons (ENC), a measure of 
codon usage affected only by the GC3y (the percentage of G or C 
at the third position of all codons in a sequence) as a result of 
mutation pressure and drift, was calculated using the equation 
ENCexpec =2+5+29[s? +(1—s)*]-! (Wright, 1990), where s is the 
GC3y ranging from 0 to 100%. 

The ENC and simulated GC3, values were plotted as a 
curve together with the Nc and observed GC3y% values; an 
Nc x observed plot lying on the ENC x simulated curve indicates 
genetic drift/mutational bias, while plots outside the curve indicate 
natural selection (Wright, 1990). 


2.6. Analysis of conserved amino acids coded by preferred codons 
To assess the significance of each preferred codon on the 


molecular evolution of A. coronavirus, 100% conserved amino acid 
positions coded by the preferred codon(s), i.e., those with RSCUs >1, 
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were counted for each gene, and the significance of the differences synonymous and non-synonymous substitution distances (dS-dN) 
was assessed with Fisher’s exact test and the odds ratio (OR). using Mega 5 (Tamura et al., 2011). 

2.7. Protein selection test 3. Results 


To understand the relationship between codon and protein 3.1. RSCU phylogeny 
selection, the occurrence of purifying or positive selection on A. 
coronavirus S, N, NSP2 and PLP’® sequences was tested with Fisher’s Fig. 1 shows that G. gallus RSCUs segregate in a tissue-specific 
exact test of neutrality for sequence pairs using the Nei-Gojobori manner in a topology supported by bootstrap values of 100 for each 
method (Nei and Gojobori, 1986) for the difference between the gene analyzed. 
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Fig. 1. Neighbor-joining distance tree for the relative synonymous codon usage (RSCU) for the Avian coronavirus spike (S), nucleocapsid (N), non-structural protein 2 (NSP2) 
and papain-like protease (PL?) genes and the Gallus gallus beta-actin, lung surfactant protein A (SFTPA1, gray background), intestinal cholecystokinin (CCK), oviduct ovomucin 
alpha subunit (OSA) and kidney vitamin D receptor genes. The tree was based on binary data using the value 71 for RSCUs> 1 (codon is preferred) or 0 for RSCUs < 1 when 
the codon is not preferred (RSCU <1) or is neutral (RSCU = 1). ENC (effective number of codons) values <40 and >45 are marked with an asterisk and a hash, respectively; 
sequences with ENC values between 40 and 45 have no marks. The arrow indicates the separation between G. gallus and Avian coronavirus clusters. Numbers at each node 
are bootstrap values (1000 replicates, only values >50 are shown). The bar represents the codon usage preferences distance. 
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For the A. coronavirus RSCUs, all genes segregated in gene- 
specific clusters, except for the sequence EU526388.1 A2 PLP!°, 
which segregated closer to the NSP2 cluster. All strains segregated 
in a cluster separated from G. gallus, with the internal nodes result- 
ing in the genotype-specific sub-clusters for the S gene, including 
those for the archetypes Connecticut, Massachusetts and Arkansas, 
with two sub clusters and the PLP’° cluster between them. No 
pathotype-specific cluster was found. 

Though the distinction between the A. coronavirus and G. gal- 
lus RSCUs clusters is also clear for the N, NSP2 and PLP’ genes, a 
less resolved topology emerges because the distinction among the 
different genotypes is not sustained. 

For all four genes, A. coronavirus clusters show an increasing dis- 
tance from the G. gallus clusters, with them being closer to SFTPA1 
(from the respiratory tract) and more distant from cholecystokinin 
(from the intestine) and with the ubiquitous beta-actin cluster 
being the most distant from both A. coronavirus and the other G. 
gallus clusters. 


3.2. Codon adaptation index (CAI) 


Mean CAI values for the A. coronavirus S, N, NSP2 and PLP’° genes 
were 0.66 (sd 0.01), 0.77 (sd 0.01), 0.69 (sd 0.01) and 0.7 (sd 0.01), 
respectively, while, for the G. gallus genes, the mean CAI was 0.81 
(sd 0.06), ranging from 0.71 for the pulmonary gene SFTPA1 to 0.88 
for the renal vitamin D receptor (mean values for two sequences). 

A boxplot representation of G. gallus and A. coronavirus CAIs 
(Fig. 2) shows that, in relation to G. gallus, S has the lowest values 


(0.64-0.7) and N has the highest values (0.75-0.79), while NSP2 and 
PLP’ have intermediate values (0.69-0.71), with non-overlapping 
medians. 


3.3. Effective number of codons (Nc) 


The mean Ne values for A. coronavirus S, N, NSP2 and PLP'° were 
43 (sd 2.31), 44.9 (sd 3.64), 51.33 (sd 1.56) and 43.79 (sd 0.86), 
respectively, and for G. gallus, the mean Nc values were 33.59 for 
vitamin D receptor, 40.03 for beta-actin, 46.48 for cholecystokinin, 
50.21 for SFTPA1 and 53.01 for ovomucin. 


3.4. Codon selection test 


The Nc x GC3z graphs (Fig. 3) show that, regardless of the A. 
coronavirus gene under consideration (S, N, NSP2 or PLP®°), all plots 
fall either just below or in the vicinity of the ENC x GC3y expected 
curve. This same pattern was also found for the G. gallus genes, 
though with plots dislocated to the right side of the graph due toa 
higher GC3 content. 


3.5. Analysis of conserved amino acids coded by preferred codons 


The number of 100% conserved amino acid positions coded by 
the preferred codons for genes S, N, NSP2 and PLP'° was one, 20, 28 
and 71, respectively. Fisher’s exact test showed that only the S gene 
presented a statistically significant lower number of occurrences 
(Table 1) when compared to the other 3 genes (p < 0.0001), with ORs 
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Fig. 2. Four graphs showing the expected (seen in the curves of each graph) and observed (seen in the points of each graph) effective number of codons (ENC and Nc, 
respectively) (Y axis) and the expected and observed GC3x (X axis) for (a) Avian coronavirus spike (S); (b) nucleocapsid (N); (c) non-structural protein 2 (NSP2) and (d) 
papain-like protease (PLP'°) (dots) and Gallus gallus beta-actin, lung surfactant protein A, intestinal cholecystokinin, oviduct ovomucin alpha subunit and kidney vitamin D 
receptor (asterisks). 
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Table 1 

Conserved amino acid (aa) positions in the Avian coronavirus spike (S), nucleocapsid (N), non-structural protein 2 (NSP2) and papain-like protease (PL?) genes coded by a preferred codon and the preferred codon for each aa 
in the Gallus gallus beta-actin (B-act), lung surfactant protein A (SFTPA1), intestinal cholecystokinin (CCK), oviduct ovomucin alpha subunit (Ovo) and kidney vitamin D receptor (ViTD rec) genes. Tryptophan and methionine, 
coded by a single codon, were excluded. Codon preference was indicated by relative synonymous codon usage (RSCU) >1. Positions are provided only for Avian coronavirus genes as G. gallus genes were used as the reference for 
comparison. 
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G gallus Avian coronavirus 

aa__B-act CCK Ovo SFTPA1 VitD rec Ss Position N Position Nsp2 Position PL’ro Position 

F  UUC UUC UUU/UUC UUU UUC UUU 155 UUU 3313, 390 UUU 52, 64, 101, UUU 28, 54, 57, 102, 144, 211, 220, 
138, 200, 217 270, 326, 369, 393 

L = CUC/CUG CUC/CUG CUG UUG/CUU/CUA/CUG =CUC/CUG NC NC NC NC NC NC CUU 16, 47, 58, 94, 111, 129, 273, 

288, 315, 413 

I AUU/AUC AUC AUU/AUC AUU AUC NC NC AUU 319, 397 AUU 23, 199, 210 AUU _ 52, 133, 194, 226, 383, 429 

Vv GUG GUG GUU/GUG GUU/GUG GUC/GUG NC NC NC NC GUU 154, 163, 226, GUU 191, 192, 317, 323, 324, 371, 
228, 235 394, 430, 435 

Ss UCU/UCC/AGC UCC/AGC UCU/UCC/UCA/AGU/AGC = UCU/AGU/AGC UCC/AGC NC NC UCA 340, 344 NC NC NC NC 

P= ccu/ccc ccc CCU/CCC/CCA CcCcU ccc NC NC CCA 338 NC NC CCU 178, 294, 338 

T  ACC/ACA ACA ACU/ACC/ACA ACU/ACA ACC/ACG NC NC NC NC ACU 123, 167, 241 NC NC 

A GCC GCU/GCG GCU/GCC/GCA GCU/GCA Gcc NC NC GCA 376 NC NC NC NC 

Y UAC UAC UAU/UAC UAU/UAC UAC NC NC UAU 316 NC NC NC NC 

H CAC CAC CAU/CAC CAU CAC NC NC NF NF NC NC CAU 143, 201 

Q CAG CAG CAA/CAG CAA CAG NC NC CAG 312, 369, 387 NC NC NC NC 

N AAC All RSCUs=1 AAC AAU AAC NC NC AAU = 315, 385, 407 NC NC AAU 27, 82, 97, 140, 186, 296, 343 

K AAG AAG AAA AAA AAG NC NC NC NC AAA _ 6,21, 86 NC NC 

D  GAU GAU GAU/GAC GAC GAC NC NC GAU 314,374 NC NC GAU _ 5,105, 160, 176, 182, 184, 217, 

258, 281 

E GAG All RSCUs=1 GAA GAG GAG NC NC NC NC GAA 98, 136, 142, GAA 130, 164, 185, 342 
165 

C UGC UGC UGU UGU UGC NC NC UGU 320, 323 UGU 68, 242 UGU 132, 154, 183, 202, 439 

R_ CGU/AGA CGC/CGG/AGG AGA/AGG CGA/AGA CGC/CGG/AGG NC NC AGA 349 CGU 54,111 NC NC 

G ~ GGU/GGC GGC GGA GGA GGC NC NC NC NC NC NC GGU 56, 86, 177, 319, 402 

NC: no 100% conserved amino acids positions coded by the preferred codon; NF: amino acid not found in the sequence. 
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Table 2 


The mean number of amino acid residues in the sequences used for this study from the Avian coronavirus spike (S), nucleocapsid (N), non-structural protein 2 (NSP2) and 
papain-like protease (PL?'°) genes coded by a preferred codon and the preferred codon for each aa in the Gallus gallus beta-actin (B-act), lung surfactant protein A (SFTPA1), 
intestinal cholecystokinin (CCK), oviduct ovomucin alpha subunit (Ovo) and kidney vitamin D receptor (ViTD rec) genes. 


Amino acid G. gallus Avian coronavirus 
Ss N Nsp2 PLpro B-act CCK Ovo SFTPA1 VitD rec 

F 9.2 3 15.9 21.8 13 3 87 5 24 
L 14.6 4.28 26.7 37.7 27 12 111 27 43 

I 5.8 2.04 13.1 19.3 28 5 112 8 20 
Vv 14.3 7.08 231 38.3 22 5 137 10 22 

S 19.8 7.24 15.2 33.3 25 16 163.5 14 45 
P 7.0 9.96 8.1 14.7 19 8 116 9 24 
T 13,1 5.52 13.6 26.6 26 4 147 11 19 
A 14.2 5.6 28.0 36.3 29 13 85.5 15 24.5 
¥ 11.0 1.2 3.0 19.1 15 5 76 12 7 
H 5.3 0 1.0 5.3 9 4 46.5 1 13 
Q 7A 6.12 13.9 11.9 12 8 775 12 21 
N 12.6 5.8 3.9 28.1 9 2 103.5 14 13 
K 6.0 11.48 21,2 33.7 19 2 1335 14 28 
D 2.6 13.6 12.2 28.5 23 6 118 rd 33 
E 1.8 9.96 12.9 223 26 6 131 19 33.5 
Cc 7.7 2.16 5.0 11.9 6 2 201 8 13 
R 2.9 8.24 11.8 IB 18 10.5 60 7 26 
G 11.4 4.56 9.1 22,3 28 13 142 20 18 
M 43 0 5.4 2.3 17 3 37 5 22 
Ww 2.6 1 2.0 10.0 4 1.5 23 4 2 


of 21.7, 32.8 and 37.8 when compared to N, NSP2 and PL?'° genes, 
respectively, while differences among N, NSP2 and PLP’° were not 
significantly different (p >0.05). The mean number of amino acids 
for each sequence is shown in Table 2. 

The number of amino acids in the G. gallus proteins that pre- 
sented the same codons used by at least one of the A. coronavirus 
genes in 100% conserved aa positions ranged from 1 (for vitamin 
D receptor) to 15 (for ovomucin alpha), and the most conserved 
preferred codon, found for all A. coronavirus genes, was UUU for F 
(Table 1). The positions of each of the conserved amino acids coded 
by preferred codons for A. coronavirus are also shown in Table 1. 


3.6. Protein selection test 

The sequences of N, NSP2 and PLP!° from all the strains 
in this study were found to be under purifying selection as 
the p values from Fisher’s exact test were all above 0.05, with 
mean values of 0.99 for each gene and sd values of 0.06, 


0.05 and 0.08, respectively. For S sequences, the mean p value 
was 0.97 (sd 0.13), but p values <0.05 were found between 


1.0 


0.9 


0.8 


CAI 


07 —— a 


0.6 
G. gallus Ss N NSP2 PLPro 


Fig. 3. Boxplot distribution for the codon adaptation index (CAI) for Avian corona- 
virus spike (S), nucleocapsid (N), non-structural protein 2 (NSP2) and papain-like 
protease (PL?'°) and Gallus gallus beta-actin, lung surfactant protein A, intestinal 
cholecystokinin, oviduct ovomucin alpha subunit and kidney vitamin D receptor 
(represented together in a single boxplot). 


the groups of sequences FJ899690.1 Conn39528/FJ899689.1 
Conn32062/FJ904716.1 Conn461996/FJ904717.1 Conn46197 and 
AY561711.1 M41/DQ834384.1 M41, indicating positive selection 
for these strains. 


4. Discussion 


Regardless of the gene being considered, all A. coronavirus 
sequences segregated in an exclusive cluster in the RSCU tree, 
which, despite being consistently separate from the G. gallus clus- 
ter, was closer to the SFIPA1 (a gene expressed in the respiratory 
tract of chicken) cluster. Taking the codon usage for these genes as 
a reflection of the codon usage in the respiratory tract, both struc- 
tural and non-structural genes show a codon usage closer to the 
chicken respiratory tissue translational environment than to the 
reproductive, renal and enteric ones. 

This similar codon usage could allow for an improved viral repli- 
cation in the respiratory tract as a first site of viral replication, a 
feature common to all A. coronavirus strains in chickens, before the 
virus reaches other replication sites for each pathotype, as a result 
of the natural selection for codons and a more efficient translation 
of virus proteins, as already suggested for the S gene alone (Brandao, 
2012). 

Evidence of natural selection for codon usage as an evolutionary 
force acting upon A. coronavirus was found in the Nc x GC3y graphs 
(Fig. 2) because for all four viral genes, observed GC3y points fell 
outside the curve, indicating that codon usage for all the strains 
under analysis was not the sole result of the random accumulation 
of mutations. 

Nonetheless, the Nc x GC3y plots show that A. coronavirus codon 
usage could also be a consequence of mutation pressure, as the 
points were in the vicinity of the curve, meaning that the GC% at 
the synonymous 3rd codon position follows the viral genomic GC% 
to some degree. 

It must be considered that both genetic drift derived from the 
mutation pressure and natural selection detected for A. coronavirus 
could also harbor some relationship with genomic RNA secondary 
structure constraints and not only codon usage, as synonymous 
3rd base mutations, though synonymous in terms of amino acid 
codification, could result in altered RNA secondary structure 
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(Cardinale et al., 2013) and, consequently, impaired viral transcrip- 
tion, replication and assembly. As signals for RNA replication and 
genome packaging in coronaviruses are RNA secondary structure- 
dependent (Narayanan and Makino, 2007; Williams et al., 1999), 
such structures must be under intense evolutionary constraints 
that balance with codon usage evolution. 

From the host side, mutation bias has also been shown to be the 
major driving force of G. gallus codon usage evolution, with minor 
participation of natural selection (Rao et al., 2011), in agreement 
with the results presented herein, suggesting acommon evolution- 
ary path for both virus and host. 

A marked difference was noticed regarding the degree of codon 
usage bias for each A. coronavirus gene studied: for S, N and PLP°, 
all mean values were just above 40, indicating a moderate bias (Gu 
et al., 2004), but for NSP2, the mean Nc (53.01) indicated a lower 
codon usage bias. These results provide evidence that A. coronavirus 
genes have taken different codon evolution pathways depending on 
the function that each protein possesses. 

The function of NSP2 is still not clearly defined, but a role has 
been suggested as a co-factor for RNA synthesis (Graham et al., 
2005), possibly in the early stages of virus replication. Despite the 
limited number of studies on NSP2 evolution, it can be speculated 
that a less biased codon usage for a protein involved in early stages 
of viral replication would allow for a less restricted tRNA preference 
and thus a more efficient start to the viral cycle. 

The finding that the most biased gene was S (mean Nc=43) 
might be linked to its relationship with the G. gallus immune sys- 
tem. The spike protein is the main target for neutralizing antibodies, 
and thus, theoretically, the more S protein that is expressed, the 
higher the generation of a humoral immune response against S and 
the lower cell infection by A. coronavirus. 

Considering this stronger codon bias of S, the fact that S showed 
the lowest CAI value when compared to the other three genes and 
the fact that genes with lower CAIs are expressed less efficiently 
(Roth et al., 2012), a deoptimization of S expression could have 
been selected for with the advantage of lower S expression, pro- 
viding further evidence that viral proteins that participate in host 
recognition might have a codon usage less similar to that presented 
by the host (Bahir et al., 2009). 

Regarding CAI values for N, NSP2 and PLP", Fig. 3 suggests that 
the distributions were mostly above those for S, with the highest 
values for N (0.75-0.79). N protein plays a chief role in nucleocapsid 
assembly that is dependent on the association of positively charged 
amino acids with the genomic RNA of coronaviruses (Masters, 
2006) and is thus under strong purifying selection, as shown herein 
by the Fisher‘s exact test on dS-dN values. Optimization of the 
codon usage in a manner closer to that of the host would endow 
A. coronavirus with a more efficient and accurate synthesis of the 
nucleocapsid protein. 

The distribution of CAIs for NSP2 and PLP'° stayed between those 
for N and S (Fig. 3). Considering that PLP’ is a protease acting 
on the N-terminus domains of replicase polyproteins ppla and 
pplab (Ziebuhr et al., 2000), an intermediate adaptation to the 
host’s translational environment could have evolved as a balance 
between the conservation of structure of the enzymatic domain 
and the plasticity to follow amino acid mutations occurring on the 
PLP'° cleavage sites of diverse A. coronavirus types as compensatory 
mutations, showing that epistasis could also be detected at the 
codon usage evolution level. 

It is noteworthy that none of the A. coronavirus strains showed 
no possible combinations of simultaneous occurrence maxi- 
mum/minimum CAI or Nc (data not shown) for any of the four 
genes, meaning that CAI and Nc might be driven to different evo- 
lutionary pathways and that strains with a high CAI, ie., highly 
adapted to the host’s transcription environment, are not necessarily 
the ones with the lower bias, i-e., with higher Nc. 


The distribution of 100% conserved amino acid positions coded 
by the preferred codon is noteworthy when one compares the S 
gene withN, NSP2 or PLP"°, as a single position was found inaregion 
outside antigenic and hypervariable regions (Cavanagh et al., 1988; 
Kant et al., 1992) in the S gene, while for the other three genes, these 
positions (n=20, 28 and 71, respectively) were scattered through- 
out the regions considered, with statistically significant differences 
when compared to S (p< 0.0001, OR=21.7-37.8). 

This low number of conserved amino acid positions coded by the 
preferred codon in S could be an additional molecular evolution- 
ary mechanism for S antigenic diversity, as fine-tuning translation 
kinetics could result in high deoptimization of codon usage and a 
consequent increased fitness (Aragonés et al., 2010). 

On the other hand, possibly due to strong structural and func- 
tional constraints, N, NSP2 and PLP’° have a higher number of amino 
acid positions coded by the preferred codon, which is the same 
codon preferred by the host (Table 1), which would allow higher 
fitness to the host transcription environment (Zhou et al., 2012) in 
a concerted virus—host molecular evolution. 

Thus, taking conserved amino acid positions coded by the 
preferred codons as a selection unit, it follows from the above men- 
tioned differences that natural selection could either be positive for 
these positions, leading a protein under purifying selection (e.g., N, 
NSP2 and PLP'°) to show the same codons as the host for that amino 
acid, or negative if a protein is under positive selection (as shown 
for S). 

The most probable reason for the fact that non-100% conserved 
amino acid positions coded by a preferred codon for that amino acid 
(noted as NCin Table 1) were only found in A. coronavirus genes and 
not in the G. gallus genes is that host genes are less susceptible to 
both the occurrence of putative amino acids and codon usage poly- 
morphisms, contrary to what is observed and expected for virus 
genes. 

Nc might be considered to be an accurate indicator of codon 
usage bias because the frequency of amino acids is normalized dur- 
ing the analysis and does not add bias; however, similarly to the 
CAI, the outcome of the Nc analysis is a single number, leading to 
a loss of deep evolutionary information similar to the loss of evo- 
lutionary information in nucleotide or amino acid distance-based 
phylogenetic analyses. 

Taking into account informative sites during codon evolution 
studies, for instance, 100% conserved amino acid positions coded 
by the preferred codon for that amino acid, could unveil data that 
would otherwise be lost in the analysis and that could be used to 
gain a more comprehensive understanding of molecular evolution 
in association with the codon usage bias indicators and selection 
analysis. 

It would be interesting to use the analyses presented herein not 
only for a better understanding of virus evolution but also as sup- 
porting predictors of spill-over events, suchas influenza (Wahlgren, 
2011) and the new human coronavirus (Kindler et al., 2013) now 
named MERS-CoV, for which the role of codon usage evolution in 
virus adaptation to new hosts has been widely ignored. 

In conclusion, A. coronavirus codon usage evolves independently 
for each gene in a manner predictable by the protein function. 
Proteins with high functional and structural constraints are more 
adapted to G. gallus, its natural host, with a balance between natu- 
ral selection and mutation pressure, giving further molecular basis 
for the virus’ ability to exploit the host’s environment. 


Appendix A. Supplementary data 
Supplementary data associated with this article can be 


found, in the online version, at http://dx.doi.org/10.1016/j.virusres. 
2013.09.033. 
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